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The posterior distribution in a nonparametric inverse problem is 
shown to contract to the true parameter at a rate that depends on the 
smoothness of the parameter, and the smoothness and scale of the 
prior. Correct combinations of these characteristics lead to the mini- 
max rate. The frequentist coverage of credible sets is shown to depend 
on the combination of prior and true parameter, with smoother priors 
leading to zero coverage and rougher priors to conservative coverage. 
In the latter case credible sets are of the correct order of magnitude. 
The results are numerically illustrated by the problem of recovering 
a function from observation of a noisy version of its primitive. 

1. Introduction. In this paper we study a Bayesian approach to estimat- 
ing a parameter [i from an observation Y following the model 

(1.1) Y = K(i + ^=Z. 

\ n 



The unknown parameter [i is an element of a separable Hilbert space H\ , and 
is mapped into another Hilbert space H2 by a known, injective, continuous 
linear operator K:H\ — > H2. The image Kfj, is perturbed by unobserved, 
scaled Gaussian white noise Z. There are many special examples of this 
infinite-dimensional regression model, which can also be viewed as an ide- 
alized version of other statistical models, including density estimation. The 
inverse problem of estimating \i has been studied by both statisticians and 
numerical mathematicians (see, e.g., [3, 6, 8, 24, 26, 33] for reviews), but 
rarely from a theoretical Bayesian perspective; exceptions are [7] and [11]. 

The Bayesian approach to (1.1) consists of putting a prior on the pa- 
rameter fj,, and computing the posterior distribution. We study Gaussian 
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priors, which are conjugate to the model, so that the posterior distribution 
is also Gaussian and easy to derive. Our interest is in studying the proper- 
ties of this posterior distribution, under the "frequentist" assumption that 
the data Y has been generated according to the model (1.1) with a given 
"true" parameter fj,Q. We investigate whether and at what rate the posterior 
distributions contract to (jlq as n — > oo (as in [15]), but have as main interest 
the performance of credible sets for measuring the uncertainty about the 
parameter. 

A Bayesian credible set is defined as a central region in the posterior 
distribution of specified posterior probability, for instance, 95%. As a conse- 
quence of the Bernstein-von Mises theorem credible sets for smooth finite- 
dimensional parametric models are asymptotically equivalent to confidence 
regions based on the maximum likelihood estimator (see, e.g., [31], Chap- 
ter 10), under mild conditions on the prior. Thus, "Bayesian uncertainty" 
is equivalent to "frequentist uncertainty" in these cases, at least for large n. 
However, there is no corresponding Bernstein-von Mises theorem in non- 
parametric Bayesian inference, as noted in [12] . The performance of Bayesian 
credible sets in these situations has received little attention, although in 
practice such sets are typically provided as indicators of uncertainty, for 
instance, based on the spread of the output of a (converged) MCMC run. 
The paper [7] did tackle this issue and came to the alarming conclusion 
that Bayesian credible sets have frequentist coverage zero. If this were true, 
many data analysts would (justifiably) distrust the spread in the posterior 
distribution as a measure of uncertainty. For other results see [4, 13, 14] 
and [18]. 

The model considered in [7] is equivalent to our model (1.1), and a good 
starting point for studying these issues. More precisely, the conclusion of [7] 
is that for almost every parameter fiQ from the prior the coverage of a cred- 
ible set (of any level) is 0. In the present paper we show that this is only 
part of the story, and, taken by itself, the conclusion is misleading. The cov- 
erage depends on the true parameter no and the prior together, and it can 
be understood in terms of a bias-variance trade-off, much as the coverage 
of frequentist nonparametric procedures. A nonparametric procedure that 
oversmoothes the truth (too big a bandwidth in a frequentist procedure, or 
a prior that puts too much weight on "smooth" parameters) will be biased, 
and a confidence or credible region based on such a procedure will be both 
too concentrated and wrongly located, giving zero coverage. On the other 
hand, undersmoothing does work (to a certain extent), also in the Bayesian 
setup, as we show below. In this light we reinterpret the conclusion of [7] 
to be valid only in the oversmoothed case (notwithstanding a conjecture 
to the contrary in the Introduction of this paper; see page 905, answer to 
objection 4). In the undersmoothed case credible regions are conservative in 
general, with coverage tending to 1. The good news is that typically they 
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are of the correct order of magnitude, so that they do give a reasonable idea 
of the uncertainty in the estimate. 

Of course, whether a prior under- or oversmoothes depends on the regu- 
larity of the true parameter. In practice, we may not want to consider this 
known, and adapt the prior smoothness to the data. In this paper we do 
consider the effect of changing the "length scale" of a prior, but do not 
study data-dependent length scales. The effect of setting the latter by, for 
example, an empirical or full Bayes method will require further study. 

Credible sets are by definition "central regions" in the posterior distri- 
bution. Because the posterior distribution is a random probability measure 
on the Hilbert space Hi, a "central ball" is a natural shape of such a set, 
but it has the disadvantage that it is difficult to visualize. If the Hilbert 
space is a function space, then credible bands are more natural. These cor- 
respond to simultaneous credible intervals for the function at a point, and 
can be obtained from the (marginal) posterior distributions of a set of lin- 
ear functionals. Besides the full posterior distribution, we therefore study 
its marginals for linear functionals. The same issue of the dependence of 
coverage on under- and oversmoothing arises, except that "very smooth" 
linear functionals cancel the inverse nature of the problem, and do allow 
a Bernstein- von Mises theorem for a large set of priors. Unfortunately point 
evaluations are usually not smooth in this sense. 

Thus, we study two aspects of inverse problems — recovering the full pa- 
rameter fj, (Section 4) and recovering linear functionals (Section 5). We ob- 
tain the rate of contraction of the posterior distribution in both settings, in 
its dependence on parameters of the prior. Furthermore, and most impor- 
tantly, we study the "frequentist" coverage of credible regions for \i in both 
settings, for the same set of priors. In the next section we give a more precise 
statement of the problem, and in Section 3 we describe the priors that we 
consider and derive the corresponding posterior distributions. In Section 6 
we illustrate the results by simulations and pictures in the particular exam- 
ple that K is the Volterra operator. Technical proofs are placed in Sections 7 
and 8 at the end of the paper. 

Throughout the paper and || • ||i, and {-,-)2 and || • | [ 3 denote the 

inner products and norms of the Hilbert spaces H± and i?2- The adjoint of 
an operator A between two Hilbert spaces is denoted by A T . The Sobolev 
space S@ with its norm || • ||^ is defined in (2.2). For two sequences (a n ) 
and (b n ) of numbers a n x b n means that \a n /b n \ is bounded away from zero 
and infinity as n — > 00, and a n < b n means that a n /b n is bounded. 

2. Detailed description of the problem. The noise process Z in (1.1) 
is the standard normal or iso- Gaussian process for the Hilbert space H2. 
Because this is not realizable as a random element in H2, the model (1.1) is 
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interpreted in process form (as in [3]). The iso-Gaussian process is the zero- 
mean Gaussian process Z = (Z^ : h € H 2 ) with covariance function YiZ^Z^i = 
(h, h') 2 , and the measurement equation (1.1) is interpreted in that we observe 
a Gaussian process Y = (Y^ : h € H 2 ) with mean and covariance functions 

(2.1) EY h = (Kfi,h) 2 , cov(Y h ,Y h ,) = -(h,ti) 2 . 

n 

Sufficiency considerations show that it is statistically equivalent to observe 
the subprocess (Y^. : i £ N), for any orthonormal basis hi, h 2 , ■ ■ ■ of H 2 . 

If the operator K is compact, then the spectral decomposition of the self- 
adjoint operator K T K : H\ — > H\ provides a convenient basis. In the compact 
case the operator K T K possesses countably many positive eigenvalues k? 
and there is a corresponding orthonormal basis (ej) of H± of eigenfunctions 
(hence, K T Kei = nfei for i G N; see, e.g., [23]). The sequence (fi) defined by 
Ke{ = Kifi forms an orthonormal "conjugate" basis of the range of K in H 2 . 
An element \i € H\ can be identified with its sequence (fj,j) of coordinates 
relative to the eigenbasis (ej), and its image Kfi = ^2 i pnKei = Ylif^i^fi 
can be identified with its coordinates relative to the conjugate ba- 

sis (fi). If we write Yi for Yj., then (2.1) shows that Yi,Y 2 , . . . are indepen- 
dent Gaussian variables with means EYi = /Xj/Cj and variance 1/n. Therefore, 
a concrete equivalent description of the statistical problem is to recover the 
sequence (/ij) from independent observations Y\,Y 2 ,... with N(niKi,l/ re- 
distributions. 

In the following we do not require K to be compact, but we do assume 
the existence of an orthonormal basis of eigenfunctions of K T K. The main 
additional example we then cover is the white noise model, in which K is 
the identity operator. The description of the problem remains the same. 

If Ki — > 0, this problem is ill-posed, and the recovery of [i from Y an 
inverse problem. The ill-posedness can be quantified by the speed of decay 
Ki 1 0. Although the tools are more widely applicable, we limit ourselves to 
the mildly ill-posed problem (in the terminology of [6]) and assume that the 
decay is polynomial: for some p > 0, 



Estimation of fi is harder if the decay is faster (i.e., p is larger). 

The difficulty of estimation may be measured by the minimax risks over 
the scale of Sobolev spaces relative to the orthonormal basis (ej) of eigen- 
functions of K T K. For (3 > define 



(2.2) 



OO 



i=l 



Then the Sobolev space of order /3 is S@ = {/x € Hi : \\/jl\\b < °°}- The min- 
imax rate of estimation over the unit ball of this space relative to the loss 
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|| t — fi\\\ of an estimate t for /x is n~^^ 1+2/3+2p \ This rate is attained by var- 
ious "regularization" methods, such as generalized Tikhonov and Moore- 
Penrose regularization [1, 3, 6, 16, 19]. The Bayesian approach is closely 
connected to these methods: in Section 3 the posterior mean is shown to be 
a regularized estimator. 

Besides recovery of the full parameter fj,, we consider estimating linear 
functionals L/i. The minimax rate for such functionals over Sobolev balls 
depends on L as well as on the parameter of the Sobolev space. Decay of 
the coefficients of L in the eigenbasis may alleviate the level of ill-posedness, 
with rapid decay even bringing the functional in the domain of "regular" 
n -1 / 2 -rate estimation. 

3. Prior and posterior distributions. We assume a mean-zero Gaussian 
prior for the parameter \x. In the next three paragraphs we recall some 
essential facts on Gaussian distributions on Hilbert spaces. 

A Gaussian distribution N{v, A) on the Borel sets of the Hilbert space H\ 
is characterized by a mean v, which can be any element of Hi, and a co- 
variance operator A : H\ — > Hi, which is a nonnegative-definite, self-adjoint, 
linear operator of trace class: a compact operator with eigenvalues (Aj) that 
are summable X^i^i < 00 ( see > e -S-i [25], pages 18-20). A random ele- 
ment G in Hi is N{v, A)-distributed if and only if the stochastic process 
({G, h)i m .h£ Hi) is a Gaussian process with mean and covariance functions 

(3.1) E(G, h)i = {v, h)i, cov((G,h)i,(G,h')i) = (h,Ah / )i. 

The coefficients Gi = {G, <fi)i of G relative to an orthonormal eigenbasis {ipi) 
of A are independent, univariate Gaussians with means the coordinates (z/j) 
of the mean vector v and variances the eigenvalues Aj. 

The iso-Gaussian process Z in (1.1) may be thought of as a N{0,I)- 
distributed Gaussian element, for I the identity operator (on H2), but as / 
is not of trace class, this distribution is not realizable as a proper random 
element in Hi- Similarly, the data Y in (1.1) can be described as having 
a N{KfjL, n _1 I)-distribution. 

For a stochastic process W = (Wh '■ h £ H2) and a continuous, linear opera- 
tor A : H2 — > Hi , we define the transformation AW as the stochastic process 
with coordinates (AW)h = WjiT h , for h £ Hi. If the process W arises as 
Wh = (W, K)2 from a random element W in the Hilbert space H2, then this 
definition is consistent with identifying the random element AW in Hi with 
the process {{AW, h)i : h G Hi), as in (3.1) with G = AW. Furthermore, if A 
is a Hilbert- Schmidt operator (i.e., AA T is of trace class), and W = Z is 
the iso-Gaussian process, then the process AW can be realized as a random 
variable in Hi with a N{0, ^4^4 T )-distribution. 

In the Bayesian setup the prior, which we take A(0,A), is the marginal 
distribution of \i, and the noise Z in (1.1) is considered independent of ji. 



6 B. T. KNAPIK, A. W. VAN DER VAART AND J. H. VAN ZANTEN 

The joint distribution of (Y,/j,) is then also Gaussian, and so is the condi- 
tional distribution of \i given Y, the posterior distribution of fj,. In general, 
one must be a bit careful with manipulating possibly "improper" Gaussian 
distributions (see [20]), but in our situation the posterior is a proper Gaus- 
sian conditional distribution on H±. 

Proposition 3.1 (Full posterior). If fx is N(0, A) -distributed and Y 
given fx is N(Kfx,n~ l I) -distributed, then the conditional distribution of fx 
given Y is Gaussian N(AY,S n ) on Hi, where 

(3.2) S n = A- A{n~ 1 I + KAK T )A T , 
and A : H2 —> Hi is the continuous linear operator 

(3.3) A = A 1 / 2 (-1 + A^^KA 1 ' 2 ] A^K 1 = AK T (-1 + KAK T 

\n J \n 

The posterior distribution is proper (i.e., S n has finite trace) and equivalent 
(in the sense of absolute continuity) to the prior. 

Proof. Identity (3.3) is a special case of the identity (I + BB T )- 1 B = 
B(I + B B)~ l , which is valid for any compact, linear operator B : Hi — > H2. 
That S n is of trace class is a consequence of the fact that it is bounded 
above by A (i.e., A — S n is nonnegative definite), which is of trace class by 
assumption. 

The operator A 1 / 2 K T KA 1 / 2 :Hi^Hi has trace bounded by \\K T K\\ tr(A) 
and hence is of trace class. It follows that the variable A l / 2 K T Z can be de- 
fined as a random element in the Hilbert space Hi, and so can AY, for A 
given by the first expression in (3.3). The joint distribution of (Y,fx) is Gaus- 
sian with zero mean and covariance operator 

/ n~ l I + KAK T KA\ 
\ AK T A )■ 

Using this with the second form of A in (3.3), we can check that the cross co- 
variance operator of the variables fx — AY and Y (the latter viewed as a Gaus- 
sian stochastic process in M^ 2 ) vanishes and, hence, these variables are in- 
dependent. Thus, the two terms in the decomposition fj, = {/jl — AY) + AY 
are conditionally independent and degenerate given Y , respectively. The 
distribution of fi — AY is zero-mean Gaussian with covariance operator 
Cov(^i- AY) = Cov(p) - Cov(AY), by the independence of \i-AY and AY. 
This gives the form of the posterior distribution. 

The final assertion may be proved by explicitly comparing the Gaussian 
prior and posterior. Easier is to note that it suffices to show that the model 
consisting of all N(Kfi, n~ 1 /)-distributions is dominated. In that case the 
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posterior can be obtained using Bayes' rule, which reveals the normalized 
likelihood as a density relative to the (in fact, any) prior. To prove dom- 
ination, we may consider equivalently the distributions t^)^ N(KifJ,i, n~ l ) 
on R°° of the sufficient statistic (1^) defined as the coordinates of Y rela- 
tive to the conjugate spectral basis. These distributions, for (yu.j) G £2, are 
equivalent to the distribution (2)°^ A^C^n" 1 ), as can be seen with the help 
of Kakutani's theorem, the affinity being exp(— ^ k 2 // 2 /8) > 0. (This ar- 
gument actually proves the well-known fact that the Gaussian shift experi- 
ment obtained by translating the standard normal distribution on R°° over 
its RKHS £2 is dominated.) □ 

In the remainder of the paper we study the asymptotic behavior of the 
posterior distribution, under the assumption that Y = Khq + n~ x l 2 Z for 
a fixed fj,Q G H\. The posterior is characterized by its center AY, the pos- 
terior mean, and its spread, the posterior covariance operator S n . The first 
depends on the data, but the second is deterministic. From a frequentist- 
Bayes perspective both are important: one would like the posterior mean to 
give a good estimate for fj,Q, and the spread to give a good indication of the 
uncertainty in this estimate. 

The posterior mean is a regularization, of the Tikhonov type, of the naive 
estimator K _1 Y. It can also be characterized as a penalized least squares 
estimator (see [21, 27]): it minimizes the functional 

fl ^\\Y-K fJ ,\\ 2 2 + -\\A- 1 / 2 fi\\l 
n 

The penalty ||A _1 / 2 /i||i is interpreted as 00 if \x is not in the range of A 1 / 2 . 
Because this range is precisely the reproducing kernel Hilbert space (RKHS) 
of the prior (cf. [32]), with ||A _1 / 2 /x||i as the RKHS-norm of \x, the posterior 
mean also fits into the general regularization framework using RKHS-norms 
(see [22] ) . In any case the posterior mean is a well-studied point estimator in 
the literature on inverse problems. In this paper we add a Bayesian interpre- 
tation to it, and are (more) concerned with the full posterior distribution. 

Next consider the posterior distribution of a linear functional Lfi of the 
parameter. We are not only interested in continuous, linear functionals L^l = 
(fj,,l)l, for some given I G H\, but also in certain discontinuous functionals, 
such as point evaluation in a Hilbert space of functions. The latter entail 
some technicalities. We consider measurable linear functionals relative to 
the prior N(0,A), defined in [25], pages 27-29, as Borel measurable maps 
L : Hi — > R that are linear on a measurable linear subspace H_ 1 C Hi such 
that N(0,A)(H_ l ) = 1. This definition is exactly right to make the marginal 
posterior Gaussian. 

Proposition 3.2 (Marginal posterior). If ji is A r (0, A) -distributed andY 
given [i is iV(if//, n _1 /) -distributed, then the conditional distribution of Lfi 
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given Y for a N(0, A) -measurable linear functional L : H\ — > R is a Gaussian 
distribution N(LAY, s^) on R, where 

(3.4) 4 = (LA 1 /2)(LA 1 /2)T _ ^(n- 1 / + KAK T )(LA) T , 
and A:H2 — > H2 is the continuous linear operator defined in (3.3). 

Proof. As in the proof of Proposition 3.1, the first term in the de- 
composition Lji = L(fi — AY) + LAY is independent of Y. Therefore, the 
posterior distribution is the marginal distribution of L(/i — AY) shifted by 
LAY. It suffices to show that this marginal distribution is Af(0,s 2 ). 

By Theorem 1 on page 28 in [25], there exists a sequence of continuous 
linear maps L m : Hi — > R such that L m h — > Lh for all h in a set with proba- 
bility one under the prior II = A^(0, A). This implies that LmA 1 ' 2 /! — > LA 1 ' 2 h 
for every h £ Hi. Indeed, if V = {h € Hi : L m h — > Lh} and g ^ V, then 
Vi := V + g and V are disjoint measurable, affine subspaces of Hi, where 
IT(y) = 1. The range of A 1 / 2 is the RKHS of II and, hence, if g is in this 
range, then n(Vi) > 0, as II shifted over an element from its RKHS is equiv- 
alent to II. But then V and Vi are not disjoint. 

Therefore, from the first definition of A in (3.3) we see that L m A — > LA, 
and, hence, L m (fi — AY) — > L(fi — AY), almost surely. As L m is continu- 
ous, the variable L m {ix — AY) is normally distributed with mean zero and 
variance L m S m L T m = {L m K l l 2 ){L m k l l 2 ) T - L m A(n~ x l + KAK T )(L m A) T , 
for S n given by (3.2). The desired result follows upon taking the limit as 
m — > 00. □ 

As shown in the preceding proof, A^(0, A)-measurable linear functionals L 
automatically have the further property that LA 1 / 2 : Hi — > R is a continuous 
linear map. This shows that LA and the adjoint operators (LA 1//2 ) T and 
(LA) T are well defined, so that the formula for s 2 makes sense. If L is 
a continuous linear operator, one can also write these adjoints in terms 
of the adjoint L T of L, and express s 2 in the covariance operator S n of 
Proposition 3.1 as s 2 = LS n L T . This is exactly as expected. 

In the remainder of the paper we study the full posterior distribution 
N(AY, S n ), and its marginals N(LAY, s 2 ). We are particularly interested in 
the influence of the prior on the performance of the posterior distribution 
for various true parameters ^o- We study this in the following setting. 

Assumption 3.1. The operators K T K and A have the same eigenfunc- 
tions (ej), with eigenvalues (k 2 ) and (Aj), satisfying 

(3.5) a^t 2 ^- 1 - 2 ", C~ l i~ p < Ki <cr p 

for some a > 0, p > 0, C>1 and r n > such that nr 2 — > 00. Furthermore, 
the true parameter [1q belongs to S 13 for some /3 > 0: that is, its coordi- 
nates (no,i) relative to (e^) satisfy X^iMoi^ <oo. 
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The setting of Assumption 3.1 is a Bayesian extension of the mildly ill- 
posed inverse problem (cf. [6]). We refer to the parameter ft as the "regu- 
larity" of the true parameter fiQ. In the special case that Hi is a function 
space and (e^) its Fourier basis, this parameter gives smoothness of /iq in 
the classical Sobolev sense. Because the coefficients (/ij) of the prior pa- 
rameter \jl are normally 7V(0, Aj)-distributed, under Assumption 3.1 we have 
E^i 2a ' //? = r^^ii 20 ' \i < oo if and only if a' < a. Thus, a is "almost" 
the smoothness of the parameters generated by the prior. This smoothness 
is modified by the scaling factor r n . Although this leaves the relative sizes 
of the coefficients /Uj, and hence the qualitative smoothness of the prior, in- 
variant, we shall see that scaling can completely alter the performance of 
the Bayesian procedure. Rates r n 1 increase, and rates r n f oo decrease the 
regularity. 

4. Recovering the full parameter. We denote by II n (- |y) the posterior 
distribution N(AY, S n ), derived in Proposition 3.1. Our first theorem shows 
that it contracts as n — > oo to the true parameter at a rate e n that depends 
on all four parameters a,ft,T n ,p of the (Bayesian) inverse problem. 

Theorem 4.1 (Contraction). // fj>Q, (Aj), («i) and (r n ) are as in As- 
sumption 3.1, then E Mo n n (^ : — /xo||i > M n e n \Y) — > 0, for every M n — > oo, 
where 

(4.1) E n = ( nr 2)-W(lW)Al + rw(nr 2 ) -a/(l+2a+2p)_ 

The rate is uniform over fiQ in balls in S@ . In particular: 

(i) If T n = 1 , then e n = n -(<*W)/(i+2a+2 P ) _ 

(ii) // < l + 2a + 2p and r n x n (<*-P)/(i+2P+2p) t then £n = n -p/(i+2p+2 P ) _ 

(iii) If p > 1 + 2a + 2p, then e n > n -0/(i+2/3+2 P ) ; f or 

every scaling r n . 

The minimax rate of convergence over a Sobolev ball S@ is of the order 
n -/3/(i+2/3+2p) ^ gee jgj^ gy ^ Q £ theorem the posterior contraction rate 

is the same if the regularity of the prior is chosen to match the regularity of 
the truth (a = f3) and the scale r n is fixed. Alternatively, the optimal rate 
is also attained by appropriately scaling (r n x n ( a -P)/( 1 +' 2 l 3 + 2 P) t determined 
by balancing the two terms in e„) a prior that is regular enough (/3 < 1 + 
2a + 2p). In all other cases (no scaling and a//9, or any scaling combined 
with a rough prior ft > 1 + 2a + 2p) , the contraction rate is slower than the 
minimax rate. 

That "correct" specification of the prior gives the optimal rate is com- 
forting to the true Bayesian. Perhaps the main message of the theorem is 
that even if the prior mismatches the truth, it may be scalable to give the 
optimal rate. Here, similar as found by [29] in a different setting, a smooth 
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prior can be scaled to make it "rougher" to any degree, but a rough prior 
can be "smoothed" relatively little (namely, from a to any /3 < 1 + 2a + 2p). 
It will be of interest to investigate a full or empirical Bayesian approach to 
set the scaling parameter. 

Bayesian inference takes the spread in the posterior distribution as an 
expression of uncertainty. This practice is not validated by (fast) contrac- 
tion of the posterior. Instead we consider the frequentist coverage of credible 
sets. As the posterior distribution is Gaussian, it is natural to center a cred- 
ible region at the posterior mean. Different shapes of such a set could be 
considered. The natural counterpart of the preceding theorem is to consider 
balls. In the next section we also consider bands. (Alternatively, one might 
consider ellipsoids, depending on geometry of the support of the posterior.) 

Because the posterior spread S n is deterministic, the radius is the only 
degree of freedom when we choose a ball, and we fix it by the desired "credi- 
bility level" 1 — 7 G (0, 1). A credible ball centered at the posterior mean AY 
takes the form, where B(r) denotes a ball of radius r around 0, 

(4.2) AY + B{r nn ) := {/x € H x : - AY\\ X < r n , 7 }, 
where the radius r.„ j7 is determined so that 

(4.3) Tl n (AY + B(r na )\Y) = l- 1 . 

Because the posterior spread S n is not dependent on the data, neither is the 
radius r nT The frequentist coverage or confidence of the set (4.2) is 

(4.4) P^ eAY + B(r na )), 

where under the probability measure P w the variable Y follows (1.1) with 
fi = fiQ. We shall consider the coverage as n — > oo for fixed /j,q, uniformly in 
Sobolev balls, and also along sequences [Iq that change with n. 

The following theorem shows that the relation of the coverage to the 
credibility level 1 — 7 is mediated by all parameters of the problem. For 
further insight, the credible region is also compared to the "correct" fre- 
quentist confidence ball AY + B(r n ^), which has radius f nj7 chosen so that 
the probability in (4.4) with r nn replaced by f na is equal to 1 — 7. 

Theorem 4.2 (Credibility). Let fiQ, (Aj), (ni), and r n be as in As- 
sumption 3.1, and set f3 = (3 A (1 + 2a + 2p). The asymptotic coverage of the 
credible region (4-2) is: 

(i) 1, uniformly in fiQ with ||^o||/3 ^ 1> if T n ^> n^ a ~^^ 1+2 ^ +2p ^ ; in this 

(ii) 1, for every fixed fj, G , if (3 < l + 2a + 2p andr n x ra (<*-0)/(i+2/9+2 P ) . 
c, along some /jJq with sup n ||//Q \\p < 00, if r n x n ( a -l 3 )/( 1 + 2 l 3 + 2 P) ( an y c £ 
[OA))- 

(iii) 0, along some /Xq with supJI^qH^ < 00, if r n <C n( a_/3 ) // ( 1+2/3+2p ) . 
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If r n = 1, then the cases (i), (ii) and (iii) arise if a < (3, a = (3 and a> f3, 
respectively. In case (iii) the sequence can then be chosen fixed. 

The theorem is easiest to interpret in the situation without scaling (r n = 1). 
Then oversmoothing the prior [case (iii) : a > f3] has disastrous consequences 
for the coverage of the credible sets, whereas undersmoothing [case (i): a < (3] 
leads to conservative confidence sets. Choosing a prior of correct regularity 
[case (ii): a = 0\ gives mixed results. 

Inspection of the proofs shows that the lack of coverage in case of over- 
smoothing arises from a bias in the positioning of the posterior mean com- 
bined with a posterior spread that is smaller even than in the optimal case. 
In other words, the posterior is off mark, but believes it is very right. The 
message is that (too) smooth priors should be avoided; they lead to overcon- 
fident posteriors, which reflect the prior information rather than the data, 
even if the amount of information in the data increases indefinitely. 

Under- and correct smoothing give very conservative confidence regions 
(coverage equal to 1). However, (i) and (ii) also show that the credible ball 
has the same order of magnitude as a correct confidence ball (1 > r n ^/ 
r„ j7 3> 0), so that the spread in the posterior does give the correct order 
of uncertainty. This at first sight surprising phenomenon is caused by the 
fact that the posterior distribution concentrates near the boundary of a ball 
around its mean, and is not spread over the inside of the ball. The cover- 
age is 1, because this sphere is larger than the corresponding sphere of the 
frequentist distribution of AY, even though the two radii are of the same 
order. 

By Theorem 4.1 the optimal contraction rate is obtained (only) by a prior 
of the correct smoothness. Combining the two theorems leads to the con- 
clusion that priors that slightly undersmooth the truth might be preferable. 
They attain a nearly optimal rate of contraction and the spread of their 
posterior gives a reasonable sense of uncertainty. 

Scaling of the prior modifies these conclusions. The optimal scaling r n x 
n («-/ 3 )/( 1 + 2Q + 2 p) found in Theorem 4.1, possible if /3 < 1 + 2a + 2p, is covered 
in case (ii). This rescaling leads to a balancing of square bias, variance and 
spread, and to credible regions of the correct order of magnitude, although 
the precise (uniform) coverage can be any number in [0,1). Alternatively, 
bigger rescaling rates are covered in case (i) and lead to coverage 1. The 
optimal or slightly bigger rescaling rate seems the most sensible. It would 
be interesting to extend these results to data-dependent scaling. 

5. Recovering linear functionals of the parameter. We denote by II n (^ : 
Lfi 6 • \ Y) the posterior distribution of the linear functional L, as described 
in Proposition 3.2. A continuous, linear functional L:H\ — > M can be iden- 
tified with an inner product Lfi = (fj,,l)±, for some I S H±, and hence with 
a sequence (li) in £2 giving its coordinates in the eigenbasis (ej). 
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As shown in the proof of Proposition 3.2, for L in the larger class of 
iV(0, A)-measurable linear functionals, the functional LA 1 / 2 is a continu- 
ous linear map on H\ and hence can be identified with an element of H\. 
For such a functional L we define a sequence (li) by Zj = (LA 1 / 2 )^/^/^, 
for ((LA 1 / 2 )^) the coordinates of LA 1 / 2 in the eigenbasis. The assumption 
that L is a N(0, A) -measurable linear functional implies that Yli^i^i < 00 > 
but (Zi) need not be contained in £2', if (^) £ ^2, then L is continuous and 
the definition of (li) agrees with the definition in the preceding paragraph. 

We measure the smoothness of the functional L by the size of the coeffi- 
cients li, as i — > 00. First we assume that the sequence is in S q , for some q. 

Theorem 5.1 (Contraction). If [1q, (Aj), (/Cj) and (r n ) are as in As- 
sumption 3.1 and the representer (li) of the N(0, A) -measurable linear func- 
tional L is contained in S q for q > —j3, then Eu II re (/i : \Ln — L[i$\ > 
M n e n \Y) — > 0, for every sequence M n — > 00, where 

£n = / nr 2N-(^+g)/(l+2a+2p)Al +Tre / T 2^-(l/2 +Q! +g)/(l+2a+2p)A(l/2)_ 

The rate is uniform over /io in balls in . In particular: 

(i) Ifr n = l, then e n = n -WA(i/2+a)+ q )/{i+2a+2 P ) Vn -i/2_ 

(ii) If q<p and (3 + q < 1 + 2a + 2p and T n ~ n (V2+a-/3)/(2/3+2 P ) ; then 

£n = n -((3+q)/(2(3+2p) _ 

(iii) If q<p and /3 + q > 1 + 2a + 2p, then e n > n~^ +q ^/ ( 2 ^+ 2 p) for every 
scaling r n . 

(iv) If q>p and T n > n W2+a-fi+ P -q)/{2fi+2 q ) ^ where p = p f\(i + 2a + 
2p — q) , then e n = n -1 / 2 . 

If q > P, then the smoothness of the functional L cancels the ill-posedness 
of the operator K, and estimating L\x becomes a "regular" problem with 
an ra -1 / 2 rate of convergence. Without scaling the prior (r n = 1), the pos- 
terior contracts at this rate [see (i) or (iv)] if the prior is not too smooth 
(a < (3 — 1/2 + q —p). With scaling, the rate is also attained, with any prior, 
provided the scaling parameter r n does not tend to zero too fast [see (iv)]. 
Inspection of the proof shows that too smooth priors or too small scale 
creates a bias that slows the rate. 

If q < p, where we take q the "biggest" value such that (li) G S q , esti- 
mating L(i is still an inverse problem. The minimax rate over a ball in the 
Sobolev space S@ is known to be bounded above by n~ <y ^ +q ^^ 2 ^ +2p ^ > (see 
[8, 9, 16]). 

This rate is attained without scaling [see (i) : r n = 1] if and only if the prior 
smoothness a is equal to the true smoothness j3 minus 1/2 (a + 1/2 = f3). 
An intuitive explanation for this apparent mismatch of prior and truth is 
that regularity of the parameter in the Sobolev scale (fiQ £ 5^) is not the 
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appropriate type of regularity for estimating a linear functional Lfi. For 
instance, the difficulty of estimating a function at a point is determined by 
the regularity in a neighborhood of the point, whereas the Sobolev scale 
measures global regularity over the domain. The fact that a Sobolev space 
of order (3 embeds continuously in a Holder space of regularity (3 — 1/2 might 
give a quantitative explanation of the "loss" in smoothness by 1/2 in the 
special case that the eigenbasis is the Fourier basis. In our Bayesian context 
we draw the conclusion that the prior must be adapted to the inference 
problem if we want to obtain the optimal frequentist rate: for estimating 
the global parameter, a good prior must match the truth (a = (3), but for 
estimating a linear functional a good prior must consider a Sobolev truth of 
order (3 as having regularity a = f3 — 1/2. 

If the prior smoothness a is not (3 — 1/2, then the minimax rate may still 
be attained by scaling the prior. As in the global problem, this is possible 
only if the prior is not too rough [[3 + q < 1 + 2a + 2p, cases (ii) and (hi)] . The 
optimal scaling when this is possible [case (ii)] is the same as the optimal 
scaling for the global problem [Theorem 4.1(h)] after decreasing f3 by 1/2. 
So the "loss in regularity" persists in the scaling rate. Heuristically this 
seems to imply that a simple data-based procedure to set the scaling, such 
as empirical or hierarchical Bayes, cannot attain simultaneous optimality in 
both the global and local senses. 

In the application of the preceding theorem, the functional L, and hence 
the sequence (li), will be given. Naturally, we apply the theorem with q equal 
to the largest value such that (li) € S q . Unfortunately, this lacks precision 
for the sequences (li) that decrease exactly at some polynomial order: a se- 
quence l{ >c i~ q ~ x l 2 is in S q ' for every q' < q, but not in S q . In the following 
theorem we consider these sequences, and the slightly more general ones 
such that < i~ q ~ 1 / 2 S(i), for some slowly varying sequence S(i). Recall 
that 5: [0,oo) — >M. is slowly varying if S(tx)/S(t) — > 1 as t — > oo, for every 
x > 0. [For these sequences (li) £ S q for every q' < q, (li) ^ S q for q' > q, 
and (U) G S q if and only if E^ 2 (i)/i < oo.] 

Theorem 5.2 (Contraction). // (Aj), (/tj) and (r n ) are as in As- 
sumption 3.1 and the representer (li) of the N(0, A) -measurable linear func- 
tional L satisfies \li\ < i~ q ~ 1 / 2 S(i) for a slowly varying function S and 
q > — (3, then the result of Theorem 5.1 is valid with 

(5.1) E n = (nT 2 )-^ +q ^ 1+2a+2 ^ M 7n + rn ( nr 2 ) -(l/2+ a + 9 )/(l+2a+2 P )A(l/2) ^ 

where, for p n = (nr 2 ) l ^ 1+2a+2 P\ 

(S 2 (p n ), ifP + q<l + 2a + 2p, 



if (3 + q = l + 2a + 2p, 



«<Pn 

U, if l3 + q>l + 2a + 2p, 
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S 2 (Pn), 


ifq<p 


< i ' 


ifq=p 


l<Pn 




1, 


ifq>p 



This has the same consequences as in Theorem 5.1, up to the addition of 
slowly varying terms. 

Because the posterior distribution for the linear functional L[i is the one- 
dimensional normal distribution N(LAY, s^), the natural credible interval 
for Lfi has endpoints LAY ± z^^Sn, for z 7 the (lower) standard normal 
7-quantile. The coverage of this interval is 

Pfj, Q (LAY + Zry/ 2 s n < LfXQ < LAY - z 7/2 s n ), 

where Y follows (1.1) with fi = [J,q. To obtain precise results concerning cov- 
erage, we assume that (Zj) behaves polynomially up to a slowly varying term, 
first in the situation q < p that estimating L\i is an inverse problem. Let f n 
be the (optimal) scaling r n that equates the two terms in the right-hand 
side of (5.1). This satisfies f n >c n ^ 1 / 2+a ~^^ 2 ^ +2p ^r] n , for a slowly varying 
factor r] n , where (3 = (3 A (1 + 2a + 2p — q). 

Theorem 5.3 (Credibility). Let fiQ, (Aj), (kj) and (r n ) be as in Assump- 
tion 3.1, and let |Zj| = z _l?_1 / 2 5(i) for q <p and a slowly varying function S. 
Then the asymptotic coverage of the interval LAY ± z^^Sn is: 

(i) in (1 — 7, 1) ; uniformly in ^lq such that ||//o||/3 5; 1 if r n ^> T n . 

(ii) in (1 — 7, 1), for every fio £ S 1 ^, i/ r n >c f n and /3 + q<l + 2a-|- 2p; 
m (0,c), a/on<7 some /Xq with sup n 1 1 a*q 1 1 >s < 00 if T n^T n [any c E (0, 1)/. 

(iii) aZong some /j,q with sup n ||/Xq ||/3 < oo i/r n <C f n . 

In case (iii) the sequence fifi can be taken a fixed element n$ in if T n < 
n~ s f n for some 5 > 0. 

Furthermore, if r n = 1, then the coverage takes the form as in (i), (ii) 
and (iii) if a < (3 — 1/2, a = (3 — 1/2, and a > (3 — 1/2, respectively, where 
in case (iii) the sequence /Iq can be taken a fixed element. 

Similarly, as in the nonparametric problem, oversmoothing leads to cov- 
erage 0, while undersmoothing gives conservative intervals. Without scaling 
the cut-off for under- or oversmoothing is at a = f3 — 1/2; with scaling the 
cut-off for the scaling rate is at the optimal rate f n . 

The conservativeness in the case of undersmoothing is less extreme for 
functionals than for the full parameter, as the coverage is strictly between the 
credibility level 1 — 7 and 1. The general message is the same: oversmoothing 
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is disastrous for the interpretation of credible band, whereas undersmoothing 
gives bands that at least have the correct order of magnitude, in the sense 
that their width is of the same order as the variance of the posterior mean 
(see the proof). Too much undersmoothing is also undesirable, as it leads to 
very wide confidence bands, and may cause that is no longer finite 

(see measurability property). 

The results (i) and (ii) are the same for every q <p, even if r n = 1. Closer 
inspection would reveal that for a given /j,q the exact coverage depends on q 
[and S(i)] in a complicated way. 

If Q > Pi then the smoothness of the functional L compensates the lack 
of smoothness of K , and estimating Lfi is not a true inverse problem. 
This drastically changes the performance of credible intervals. Although 
oversmoothing again destroys their coverage, credible intervals are exact 
confidence sets if the prior is not too smooth. We formulate this in terms of 
a Bernstein-von Mises theorem. 

The Bernstein-von Mises theorem for parametric models asserts that the 
posterior distribution approaches a normal distribution centered at an effi- 
cient estimator of the parameter and with variance equal to its asymptotic 
variance. It is the ultimate link between Bayesian and frequentist procedures. 
There is no version of this theorem for infinite-dimensional parameters [12], 
but the theorem may hold for "smooth" finite-dimensional projections, such 
as the linear functional L[i (see [5]). 

In the present situation the posterior distribution of L/j, is already normal 
by the normality of the model and the prior: it is a N(LAY, .^-distribution 
by Proposition 3.2. To speak of a Bernstein-von Mises theorem, we also 
require the following: 

(i) That the (root of the) spread s n of the posterior distribution is 
asymptotically equivalent to the standard deviation t n of the centering vari- 
able LAY. 

(ii) That the sequence {LAY — L\iQ)jt n tends in distribution to a stan- 
dard normal distribution. 

(iii) That the centering LAY is an asymptotically efficient estimator 
of L/i. 

We shall show that (i) happens if and only if the functional L cancels the ill— 
posedness of the operator K, that is, if q > p in Theorem 5.2. Interestingly, 
the rate of convergence t n must be n~ 1 / 2 up to a slowly varying factor in this 
case, but it could be strictly slower than n -1 / 2 by a slowly varying factor 
increasing to infinity. 

Because LAY is normally distributed by the normality of the model, 
assertion (ii) is equivalent to saying that its bias tends to zero faster than t n . 
This happens provided the prior does not oversmooth the truth too much. 
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For very smooth functionals (q > p) there is some extra "space" in the cut- 
off for the smoothness, which (if the prior is not scaled: r n = 1) is at a = (3 — 
1/2 + q — p, rather than at a = /3 — 1/2 as for the (global) inverse estimating 
problem. Thus, the prior may be considerably smoother than the truth if 
the functional is very smooth. 

Let || • || denote the total variation norm between measures. Say that / G 
IZ q if |/j| = i~ q ~ 1 / 2 S{i) for a slowly varying function S. Write 

B n = sup \LAKfi — L/i\ 

IHI<3<1 

for the maximal bias of LAY over a ball in the Sobolev space S@ . Finally, 
let f n be the (optimal) scaling r n in that it equates the two terms in the 
right-hand side of (5.1). 

Theorem 5.4 (Bernstein- von Mises). Let /io, (Aj), and (k^) be as in 
Assumption 3.1, and let I be the representer of the N(0, A) -measurable linear 
functional L: 

(i) // I G S p , then s n /t n -> 1; in this case nt 2 n -> ^if/n 2 . If I G TZ q , 
then s n /t n — > 1 if and only if q> p; in this case n i— > nt^ is slowly varying. 

(ii) If I a S q for q>p, then B n = o(t n ) if either r n > n (a+i/2-/3)/(2/?+2 9 ) 
or (r n = 1 and a < f3 — 1/2 + q — p). If I £ TZ q for q > p, then B n = o(t n ) 
if (Tn ^ T n ) or (r n = 1 and a < (3 — 1/2 + q — p) or (q = p, r n = l and 
a = [3 — 1/2 + q — p) or [q> p, r n = l and a = /3 — 1/2 + q — p and S(i) — > 
as i — > oo/. 

(iii) IfleS p orleIZ p andB n = o{t n ), then E M ||Il n (L/i G • \ Y) — N {LAY , 
^n)ll ~* an d {LAY — L/j,o)/t n converges under /iq in distribution to a stan- 
dard normal distribution, uniformly in \\fio\\/3 i$ 1- If I & S p , then this is also 
true with LAY and replaced by Y2i^i^i/ K i an d its variance n~ 1 ^2 i l 2 / 1 k 2 . 

In both cases (iii), the asymptotic coverage of the credible interval LAY ± 
z~f/2 s n is 1 — 7, uniformly in \\lM)\\p ^ 1- Finally, if the conditions under (ii) 
fail, then there exists fj,Q with sup n \\(1q \\p < oo along which the coverage 
tends to an arbitrarily low value. 

The observation Y in (1.1) can be viewed as a reduction by sufficiency 
of a random sample of size n from the distribution N(Kfi,I). Therefore, 
the model fits in the framework of i.i.d. observations, and "asymptotic ef- 
ficiency" can be defined in the sense of semiparametric models discussed 
in, for example, [2, 30] and [31]. Because the model is shift-equivariant, it 
suffices to consider local efficiency at /xo = 0. The one-dimensional submod- 
els N{K{th),I) on the sample space R^ 2 , for iGl and a fixed "direction" 
h G H\ , have likelihood ratios 



BAYESIAN INVERSE PROBLEMS WITH GAUSSIAN PRIORS 



17 



Thus, their score function at t = is the (Kh)th coordinate of a single obser- 
vation Y = (Yh : h G H2), the score operator is the map K : H\ — > L2(N(0, /)) 
given by Kh(Y) = Y^h, an d the tangent space is the range of .FT. [We denote 
the score operator by the same symbol K as in (1.1); if the observation Y 
were realizable in H2, and not just in the bigger sample space R^.then Y Kh 
would correspond to (Y, Kh) 2 and, hence, the score would be exactly Kh for 
the operator in (1.1) after identifying H2 and its dual space.] The adjoint of 
the score operator restricted to the closure of the tangent space is the oper- 
ator K T : KH\ C L 2 (N(0,I)) -> H x that satisfies K T (Y g ) = K T g, where K T 
on the right is the adjoint of K:H\ —> H2. The functional L[i = (l,fj)i has 
derivative I. Therefore, by [28] asymptotically regular sequences of estima- 
tors exist, and the local asymptotic minimax bound for estimating L^l is 
finite, if and only if / is contained in the range of K T . Furthermore, the 
variance bound is \\m\\\ for m G H2 such that K T m = I. 

In our situation the range of K T is S p , and if I G S p , then by Theo- 
rem 5.4(iii) the variance of the posterior is asymptotically equivalent to 
the variance bound and its centering can be taken equal to the estima- 
tor n" 1 Yili/ni, which attains this variance bound. Thus, the theorem 
gives a semiparametric Bernstein-von Mises theorem, satisfying every of (i), 
(ii), (iii) in this case. If only I G 7Z P and not I G S p , the theorem still gives 
a Bernstein-von Mises type theorem, but the rate of convergence is slower 
than n -1 / 2 , and the standard efficiency theory does not apply. 

6. Example — Volterra operator. The classical Volterra operator K: L 2 [0, 
1] — > L 2 [0, 1] and its adjoint K T are given by 

K/j,(x)= / fi(s)ds, K T n{x) = I fi(s)ds. 
Jo Jx 

The resulting problem (1.1) can also be written in "signal in white noise" 
form as follows: observe the process (Y t :t G [0,1]) given by Y t = 
Jo Jo duds + n _1 / 2 W t , for a Brownian motion W. 

The eigenvalues, eigenfunctions of K T K and conjugate basis are given by 
(see [17]), for i = 1,2,..., 

«i = Tj— 1/2)2^2 ' ei(x) = v^cos((i-l/2)7rx), 

/i(z) = V^sin((i-l/2)7ra;). 

The (fi) are the eigenfunctions of KK T , relative to the same eigenvalues, 
and Ke{ = Kifi and K T fi = KiCi, for every i G N. 

To illustrate our results with simulated data, we start by choosing a true 
function [aq, which we expand as fio = X^^o.^i on the basis (ej). The data 
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are the function 



Y = K/j, + —=Z = S2 Mo,i«i/i + —f= z - 



It can be generated relative to the conjugate basis (fi) as a sequence of in- 
dependent Gaussian random variables Yi, Y2, . . . with Yj ~ N(no^Ki,n~ 1 / 2 ). 
The posterior distribution of (i is Gaussian with mean AY and covariance 
operator S n , given in Proposition 3.1. Under Assumption 3.1 it can be rep- 
resented in terms of the coordinates (ni) of fi relative to the basis (ej) as 
(conditionally) independent Gaussian variables fJ,i,fJ,2, ■ ■■ with 

"A,k, V, A, 



Vi\Y~N 



1 + n\iK? ' 1 + nXiKf 



The (marginal) posterior distribution for the function jjl at a point x is 
obtained by expanding fi(x) = ^2i^iei(x), and applying the framework of 
linear functionals L[i = lifii with l{ = ei(x). This shows that 

We obtained (marginal) posterior credible bands by computing for every x 
a central 95% interval in the normal distribution on the right-hand side. 

Figure 1 illustrates these bands for n = 1,000. In every one of the 10 
panels in the figure the black curve represents the function fiQ, defined by 
the coefficients z -3 / 2 sin(i) relative to e« (/3 = 1). The 10 panels represent 10 
independent realizations of the data, yielding 10 different realizations of the 
posterior mean (the red curves) and the posterior credible bands (the green 
curves). In the left five panels the prior is given by Aj = i~ 2a ~ 1 with a = 1, 
whereas in the right panels the prior corresponds to a = 5. Each of the 10 
panels also shows 20 realizations from the posterior distribution. 

Clearly, the posterior mean is not estimating the true curve very well, 
even for n = 1,000. This is mostly caused by the intrinsic difficulty of the 
inverse problem: better estimation requires bigger sample size. A comparison 
of the left and right panels shows that the rough prior (a = 1) is aware of the 
difficulty: it produces credible bands that in (almost) all cases contain the 
true curve. On the other hand, the smooth prior (a = 5) is overconfident; 
the spread of the posterior distribution poorly reflects the imprecision of 
estimation. 

Specifying a prior that is too smooth relative to the true curve yields 
a posterior distribution which gives both a bad reconstruction and a mis- 
guided sense of uncertainty. Our theoretical results show that the inaccurate 
quantification of estimation error remains even as n — > 00. 

The reconstruction, by the posterior mean or any other posterior quan- 
tiles, will eventually converge to the true curve. However, specification of 
a too smooth prior will slow down this convergence significantly. This is 
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Fig. 1. Realizations of the posterior mean (red) and (marginal) posterior credible bands 
(green), and 20 draws from the posterior (dashed curves). In all ten panels n = 1,000 and 
/3 = 1. Left 5 panels: a = l; right 5 panels: a — 5. True curve (black) given by coefficients 
t*o,i = i~ 3/2 sin(i). 

illustrated in Figure 2. Every one of its 10 panels is similarly constructed 
as before, but now with n = 1,000 and n = 10 8 for the five panels on the 
left-hand and right-hand side, respectively, and with a = 1/2,1,2,3,5 for 
the five panels from top to bottom. At first sight a = 1 seems better (see the 
left column in Figure 2), but leads to zero coverage because of the mismatch 
close to the bump (see the right column), while a = 1/2 captures the bump. 
For n = 10 8 the posterior for this optimal prior has collapsed onto the true 
curve, whereas the smooth posterior for a = 5 still has major difficulty in re- 
covering the bump in the true curve (even though it "thinks" it has captured 
the correct curve, the bands having collapsed to a single curve in the figure). 
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Fig. 2. Realizations of the posterior mean (red) and (marginal) posterior credible bands 
(green), and 20 draws from the posterior (dashed curves). In all ten panels fi = 1. Left 5 
panels: n — 1,000 and a = 0.5,1,2,3,5 (top to bottom); right 5 panels: n = 10 s and 
a = 0.5,l,2,3,5 (top to bottom). True curve (black) given by coefficients [io,i = i~ 3 ^ 2 sin(i) . 



7. Proofs. 

7.1. Proof of Theorem 4-1- The second moment of a Gaussian distribu- 
tion on H\ is equal to the square norm of its mean plus the trace of its co- 
variance operator. Because the posterior is Gaussian N(AY,S n ), it follows 
that 

J \\fi-^\\ldn n (ji\Y) = \\AY-/i \\l+tr(S n ). 
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By Markov's inequality, the left-hand side is an upper bound to M 2 e 2 n n (^, : 
1 1 A* ~~ Ho 111 ^ M n e n \Y). Therefore, it suffices to show that the expectation 
under fiQ of the right-hand side of the display is bounded by a multiple of e 2 . 
The expectation of the first term is the mean square error of the posterior 
mean AY, and can be written as the sum || A/^T^o — A t o|li + n ~ 1 tr(AA- r ) of its 
square bias and "variance." The second term tr(S n ) is deterministic. Under 
Assumption 3.1 the three quantities can be expressed in the coefficients 
relative to the eigenbasis (ej) as 



(7.1) \\AKfj, - noWj = j— — y 2\2 x $Z 77 i 

1 nA 2 K- 2 ^ „ r 4,— 2-4a-2p 

1 ^ n 1 ^ 4^ (l+nA iK 2)2 ( 1 + n7 ^-l-2«-2p)2' 



(7.3) tr(S n ) 



A?; \ -< t_ i 



2„— l-2a 



By Lemma 8.1 (applied with q = (3, t = 0, u = 1 + 2a + 2p, i> = 2 and 
N = nr 2 ), the first can be bounded by \\^\\ 2 p{ nT n)~^ 2 ^^ l+2a+2p ^ 2 which 
accounts for the first term in the definition of e n . By Lemma 8.2 [applied 
with S(i) = 1, q = -1/2, t = 2 + 4a + 2p, u = 1 + 2a + 2p, u = 2, and 
iV = nr^], and again Lemma 8.2 [applied with S(i) = 1, q = — 1/2, t = 1 + 2a, 
-u = 1 + 2a + 2p, v = 1 and iV = nr^j , both the second and third expressions 
are of the order the square of the second term in the definition of e n . 

The consequences (i) and (ii) follow by verification after substitution of r n 
as given. To prove consequence (iii), we note that the two terms in the 
definition of e n are decreasing and increasing in r n , respectively. Therefore, 
the maximum of these two terms is minimized with respect to r n by equating 
the two terms. This minimum (assumed at r n = n ~( 1+Q+2p )/( 3+4a+6p )) i s 
much bigger than n -£/(i+2/3+2p) if /3 > i + 2a + 2p. 

7.2. Proof of Theorem 5.1. By Proposition 3.2 the posterior distribution 
is N(LAY, s 2 ), and, hence, similarly as in the proof of Theorem 4.1, it suffices 
to show that 

Eu \LAY - L^i \ 2 + s 2 n = \LAKfi - L^i \ 2 + -||i-4|| 2 + s 2 

n 

is bounded above by a multiple of e 2 n . Under Assumption 3.1 the expressions 
on the right can be written 

(7.4) LAK^-L^ = -y^-^-<^ M 



^ 1 + nXiK 2 ~ 4^ 1 + nr^- 1 - 2 "- 2 ? ' 
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2 _ 1 I. T 4 ||2 _ tjntfrf 
t n :- JLA h -^ {1+ 2)2 

(7.5) 

xnr 4V ^ 

"Y( 1 +"^" 1 " 2a " 2P ) 2 ' 

72 \. /2„— l-2a 

( 7 6) s 2 = V ^ xt 2 V ^ 

( ' n n ^l + nr2z-i-2«-2p' 

By the Cauchy-Schwarz inequality the square of the bias (7.4) satisfies 

(7.7) \LAK» - L^\ 2 < ll^ll 2 {1 + nT 2 i -i-2a-2 P) 2 - 

By Lemma 8.1 (applied with q = q,t = 2/3, u = 1 + 2a + 2p, u = 2 and N = 
tit 2 ) the right-hand side of this display can be further bounded by ||jUo|llPllg 
times the square of the first term in the sum of two terms that defines e n . 
By Lemma 8.1 (applied with q = q,t = 2 + Aa-\- 2p, u= 1 + 2a + 2p, v = 2 and 
N = nr„) and again Lemma 8.1 (applied with q = q, t = 1 + 2a, u = 1 + 2a + 
2p,v = 1 and N = nr 2 ), the right-hand sides of (7.5) and (7.6) are bounded 
above by \\l\\ 2 times the square of the second term in the definition of e n . 

Consequences (i)-(iv) follow by substitution, and, in the case of (iii), 
optimization over r n . 

7.3. Proof of Theorem 5.2. This follows the same lines as the proof of 
Theorem 5.1, except that we use Lemma 8.2 (with q = q,t = 2/3, u = 1 + 2a + 2p, 
v = 2 and N = nr 2 ) and Lemma 8.2 (with q = q,t = 2 + 4a + 2p, u = 1 + 2a + 2p, 
v = 2 and N = nr 2 ) and again Lemma 8.2 (with q = q, t = 1 + 2a, u = 

1 + 2a + 2p, v = 1 and N = nr 2 ) to bound the three terms (7.5)-(7.7). 

7.4. Proof of Theorem 4-2. Because the posterior distribution is N(AY, 
S n ), by Proposition 3.1, the radius r nn in (4.3) satisfies P(U n < ?* 2 . i7 ) = 1 — 7, 
for U n a random variable distributed as the square norm of an N(0, S n )- 
variable. Under (1.1) the variable AY is N(AKfiQ,n~ l AA T )-distnbuted, 
and, thus, the coverage (4.4) can be written as 

(7.8) P( || W n + AKn - jio || i<r n ,y) 

for W n possessing a iV(0, n _1 AA T )-distribution. For ease of notation let 

v n = \\w n \\l 

The variables U n and V n can be represented as U n = ^ Si^ n Zf and V n = 
^ZiU^Zf, for Z\,Z2,... independent standard normal variables, and Si )H 
and ti^ n the eigenvalues of S n and n~ 1 AA T , respectively, which satisfy 
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(1+nXiK 2 ) 2 (l + nr^-2«-2p-i)2' 
A,; r 2 i~ 2 - 1 



n 



(1+nXiK 2 ) 2 ~ (l+nT 2 ^ 2 ^- 1 ) 2 ' 

Therefore, by Lemma 8.2 (applied with 5 = 1 and q = —1/2; always the first 
case), 

EU n = ^^r 2 (nr 2 r 2 ^ 2 ^\ 

i 
i 

E(U n - V n ) = ~ kn) ^ r 2 (nr 2 )- 2 ^ 1+2a+2 ^ , 

i 

i 

varK = 2^ * r^nrlr^l^ 2 ^. 

i 

We conclude that the standard deviations of U n and V n are negligible rela- 
tive to their means, and also relative to the difference E(U n — V n ) of their 
means. Because U n >V n , we conclude that the distributions of U n and V n 
are asymptotically completely separated: P(V n < v n < U n ) — > 1 for some v n 
[e.g., v n = E(U n + V n )/2]. The numbers r 2 are 1 — 7-quantiles of U n , and, 
hence, P(V n < r 2 j7 (l + o(l))) — > 1. Furthermore, it follows that 

< 7 * r 2 (nr 2 )" 2a /( 1+2 ^) x Etf n x EF n . 

The square norm of the bias AK[iq — fiQ is given in (7.1), where it was noted 
that 

B n := sup \\AKto-to\\i*M)-M 1+ »+W 

IImoII/3<i 

The bias B n is decreasing in r n , whereas EC/ n and varU n are increasing. 
The scaling rate f n x n( Q_ ^) // ( 1+2 ' 3+2p ) balances the square bias -B 2 with the 
variance EV n of the posterior mean, and hence with r 2 . 

Case (i). In this case B n <C r nrl . Hence, P(||W n + AK/j,q — ^o||i < r n,~t) > 
P(||W„||i < r n , 7 - B n ) = P(F n < r 2 >7 (l + o(l))) -»• 1, uniformly in the set 
of fiQ in the supremum defining B n . 

Case (in). In this case B n 3> r n>7 . Hence, P(|| W n + AK^q — jUq ||i < ^,7) < 
P(|| Wn||i > — r n,7) — > for any sequence Hq (nearly) attaining the supre- 
mum in the definition of B n . If r n = 1, then B n and r n>7 are both powers 
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of 1/n and, hence, B n 3> r„ j7 implies that B n > r na n & , for some 5 > 0. The 
preceding argument then applies for a fixed [jlq of the form ^o,i x i~ 1 / 2_ ' 3_£ , 
for small e > 0, that gives a bias that is much closer than n s to B n . 

Case (ii). In this case B n x r nj7 . If /3 < 1 + 2a + 2p, then by the second 
assertion (first case) of Lemma 8.1 the bias ||AK7io — A*o||i a t a fixed //q is 
of strictly smaller order than the supremum B n . The argument of (i) shows 
that the asymptotic coverage then tends to 1. 

Finally, we prove the existence of a sequence (jlq along which the coverage 
is a given c E [0, 1). The coverage (7.8) with /j,q replaced by /j,q tends to c if, 
for b n = AKfiQ — fiQ and z c a standard normal quantile, 

sd\\W n + b n \\f 



^, 7 -E||W n + & n | i:i 

sd||W n + Ml 



Because W n is mean-zero Gaussian, we have E||W n + 6 n ||i = E||W n ||f + \\b n \\i 
and var||W" n + 6 n ||f = var||W n ||f + 4var(W n ,6 n )i. Here ||W n ||i = K and the 
distribution of (W n , b n )\ is zero-mean Gaussian with variance (b n ,n~ l AA T b n ) 
With ti^ n the eigenvalues of n~ 1 AA T , display (7.10) can be translated in the 
coefficients (6 n ,i) of b n relative to the eigenbasis, as 

(7.11) = ->• g c . 



varK + 4Ei^n&n 



,2 



We choose (& n ,i) differently in the cases that /3 < 1 + 2a + 2p and /3 > 1 + 
2a + 2p, respectively. In both cases the sequence has exactly one nonzero 
coordinate. We denote this coordinate by b n> i n , and set, for numbers d n to 
be determined, 

b l,i n = r n. n ~ EV n ~ d n sd V n . 

Because r 2 , EV n and r 2 — EV^ are of the same order of magnitude, 
and sdV n is of strictly smaller order, for bounded or slowly diverging d n the 
right-hand side of the preceding display is equivalent to (r n ^ — EV n )(l + 
o(l)). Consequently, the left-hand side of (7.11) is equivalent to 

d n sd V n 



vaxVn + te in , n (rl 7 - EK)(1 + o(l)) 

The remainder of the argument is different in the two cases. 

Case /3 < 1 + 2a + 2p. We choose i„ x (n^) 1 /!^ 2 ^ 2 ^. It can be veri- 
fied that U n)n (r^ ~~ ET4)/ var 14 x 1. Therefore, for c G [0, 1], there exists 
a bounded or slowly diverging sequence d n such that the preceding display 
tends to z r . 
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The bias b n results from a parameter fifi such that b n j, = (l + raAjft 2 )" 1 ^^)^ 
for every i. Thus, //q also has exactly one nonzero coordinate, and this is 
proportional to the corresponding coordinate of b n , by the definition of i n . 
It follows that 

by the definition of r n . It follows that ||/iQ \\p < 1. 

Case /3 > 1 + 2a + 2p. We choose % = 1. In this case r n — > and it can be 
verified that t^^r 2 ^ ~~ ^n)/ var ~ * 0- Also, 

x (1 + nr 2 ) 2 ^ < (1 + nr n 2 ) 2 EI/ n . 

This is 0(1), because r n is chosen so that EV n is of the same order as the 
square bias -B 2 , which is (nr^)" 2 in this case. 

It remains to prove the asymptotic normality (7.9). We can write 

\\W n + b n \\ 2 - E\\W n + b n \\l = J2U,n( Z i - 1) + 2b n , tn ^/t~Z in . 

i 

The second term is normal by construction. The first term has variance 
2J2i^in- With some effort it can be seen that 

t 2 

1 i,n 

Therefore, by a slight adaptation of the Lindeb erg-Feller theorem (to infinite 
sums), we have that Yliti,n(Zf ~ 1) divided by its standard deviation tends 
in distribution to the standard normal distribution. Furthermore, the pre- 
ceding display shows that this conclusion does not change if the i n th term 
is left out from the infinite sum. Thus, the two terms converge jointly to 
asymptotically independent standard normal variables, if scaled separately 
by their standard deviations. Then their scaled sum is also asymptotically 
standard normally distributed. 

7.5. Proof of Theorem 5.3. Under (1.1) the variable LAY is N(LAK/jl , 
t^-distributed, for t 2 given in (7.5). It follows that the coverage can be 
written, with W a standard normal variable, 

(7.12) P(\Wt n + LAKfj, - Lfio\ < -s n z y/2 ). 

The bias LAK^iq — LfiQ and posterior spread s 2 are expressed as a series 
in (7.4) and (7.6). 

In the proof of Theorem 5.2 s„ and t n were seen to have the same order 
of magnitude, given by the second term in e n given in (5.1), with a slowly 
varying term S n as given in the theorem, 

(7.13) Sn xt n x T„(nT 2 )-(V2+^)/(i+^+2p)^ 
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Furthermore, t n < s n for every n, as every term in the infinite series (7.5) is 
n\iKf/(l + n\iK 2 ) < 1 times the corresponding term in (7.6). 

Because W is centered, the coverage (7.12) is largest if the bias LAKfiQ — 
LfiQ is zero. It is then at least 1 — 7, because t n < s n ; remains strictly smaller 
than 1, because t n x s n ; and tends to exactly 1 — 7 iff s n /t n — > 1. By The- 
orem 5.4(i) the latter is impossible if q < p. The analysis for nonzero ^jlq 
depends strongly on the size of the bias relative to t n . 

The supremum of the bias satisfies, for 7 n the slowly varying term given 
in Theorem 5.2, 

(7.14) B n := sup |L^ -^o|^(nr n 2 )-«^)/( 1+2 ^))^ 7n . 
HmII/3<i 

That the left-hand side of (7.14) is smaller than the right-hand side was 
already shown in the proof of Theorem 5.2, with the help of Lemma 8.2. That 
this upper bound is sharp follows by considering the sequence //q defined 
by, with B n the right-hand side of the preceding display, 

n _^ i 2 k 

[This is the sequence that gives equality in the application of the Cauchy- 
Schwarz inequality to derive (7.7).] Using Lemma 8.2, it can be seen that 
||/ig \ \/3 < 1 and that the bias at //g is of the order B n . 

By Lemma 8.3, the bias at a fixed ^0 £ is of strictly smaller order than 
the supremum B n if/3 + q<l + 2a + 2p. 

The maximal bias B n is a decreasing function of the scaling parameter r n , 
while the standard deviation t n and root-spread s n increase with r n . The 
scaling rate f n in the statement of the theorem balances B n with s n xt n . 

Case (i). If r n S> f n , then B n <C t n . Hence, the bias LAK/iq — L/j,q in (7.12) 
is negligible relative to t n x s n , uniformly in ||/Uo||/3 ^ 1, and the coverage is 
asymptotic to P(|Wt n | < — s n z 7 / 2 ), which is asymptotically strictly between 
1 — 7 and 1 . 

Case (iii). If r n <C f n , then B n S> t n . If b n = LAK/j,q — Lfj,^ is the bias at 
a sequence that (nearly) attains the supremum in the definition of B n , 
then the coverage at [Iq satisfies P(|W^ n + b n \ < — s n 2: 7/ / 2 ) < P(|Wi n | > 
b n — s n \z^/ 2 \) — > 0, as b n x B n ^> s n . By the same argument, the coverage 
also tends to zero for a fixed n$ in with bias b n = LAKfiQ — LfiQ 3> t n . 
For this we choose /Uo,« = i~^i q liS{i) for a slowly varying function such that 
^2iS 2 (i)S 2 (i)/i < 00. The latter condition ensures that H/zoll^ < 00. By an- 
other application of Lemma 8.2, the bias at /iq is of the order [cf. (7.4)] 

E l iV0,i (ljS l l 2 (l)) 2 l-P +q _ 2\-(/3+ g )/(l+2a+2p)Al~ 

1 + nr2»-l-2«-2p 1 + nr 2 i -l-2a-2 P ^ \ nT n) Tn, 
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7 2 

In 



where, for p n = (nr^/a+a^p) ; 

S 2 (p n )S(p n ), if f3 + q<l + 2a + 2p, 

y^S^m i{ p + q = 1 + 2a + 2 P , 

«<Pn 

1, if fi + q> l + 2a + 2p. 

Therefore, the bias at jjlq has the same form as the maximal bias B n ; the 
difference is in the slowly varying factor j n . If r n < f n n~ s , then B n > t n n 5 
for some 5' > and, hence, 6 n x -Bn7n/7n 3> in- 

Case (ii). If r n xf„, then Z? n x i n . If 6 n = LAKp^ — Lp^ is again the bias 
at a sequence (j,q that nearly assumes the supremum in the definition of B n , 
we have that P(|VFt n + db n \ < —SnZ^^) < P(|Wt n | > db n — s n \Zj/2\) attains 
an arbitrarily small value if d is chosen sufficiently large. This is the coverage 
at the sequence dp^, which is bounded in S 13 . On the other hand, the bias 
at a fixed po £ ^ is of strictly smaller order than the supremum B n , and, 
hence, the coverage at a fixed is as i n case (i)- 

If the scaling rate is fixed to r n = 1, then it can be checked from (7.13) 
and (7.14) that B n <ti t n , B n x t n and B n ^> t n in the three cases a < f3 — 1 /2, 
a = ft — 1/2 and a > /3 — 1/2, respectively. In the first and third cases the 
maximal bias and the spread differ by more than a polynomial term n s ; in 
the second case it must be noted that the slowly varying terms j n and 8 n are 
equal [to S(p n )]. It follows that the preceding analysis (i), (ii), (iii) extends 
to this situation. 



7.6. Proof of Theorem 5.4- (i). The two quantities s n and t n are given as 
series in (7.6) and (7.5). Every term in the series (7.5) is nXinf / '(1 + nXi^f) < 
1 times the corresponding term in the series (7.6). Therefore, s n /t n — > 1 if 
and only if the series are determined by the terms for which these numbers 
are "close to" 1, that is, nAjK? is large. More precisely, we show below that 
s n /t n — > 1 if and only if, for every c> 0, 

(?1+^)" 

If / £ S p , then the series on the left is as in Lemma 8.1 with q = p, u = 
1 + 2a + 2p, v = 1, iV = tit 2 and t = 1 + 2a. Hence, (t + 2q) ju > v, and the 
display follows from the final assertion of the lemma. If l{ = i~ q ~ l l 2 S(i) for 
a slowly varying function S, then the series is as in Lemma 8.2, with the 
same parameters, and by the last statement of the lemma the display is true 
if and only if (t + 2q)/u > v, that is, q>p. 

To prove that (7.15) holds iff s n /t n — > 1, write = A n + B n , for A n 
and B n the sums over the terms in (7.6) with nAjK? > c and nAjK? < c, 



(7.15) 



E 

nA,K 2 <c 



1 + nXiK 2 
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respectively, and, similarly, t n = C n + D n . Then 

— < < — < 1. 

B n ~ 1 + c - A n ~ 

It follows that 

t 2 n _ C n + D n _ C n /A n + (D n /B n )(B n /A n ) < 1 + c/(l + c)(£ n A4 n ) 
4 A n + B n 1 + BjAn ~ l + B n /A n 

Because x i— > (1 + rx) /(l + x) is strictly decreasing from 1 at x = to r < 1 at 
x = oo (if < r < 1), the right-hand side of the equation is asymptotically 1 
if and only if B n /A n — > 0, and otherwise its liminf is strictly smaller. Thus, 
tn/s n — > 1 implies that B n /A n — > 0. Second, 

g > C n C n /A n > c/(l + c) 

S 2 - A n + S n 1 + B n /A n ~ 1 + B n /A n ' 

It follows that liminf t n /s n > c/(l + c) if B n /A n — >• 0. This being true for 
every c > implies that t n /s n — > 1. 

(i) Second assertion. If £ £ jS p , then we apply Lemma 8.1 with q = p, t = 
l + 2a, u = l + 2a + 2p, v = 1 and iV = nr^ to see that x T n( nT n)~ v = n ~ l ■ 
Furthermore, the second assertion of the lemma with (uv — 1)/2 = p shows 
that ns n — > \\l\\p = Yli^'i/ K 'i m the case that Ki = i~ p . The proof can be 
extended to cover the slightly more general sequence (k^ in Assumption 3.1. 

If I € lZ q , then we apply Lemma 8.2 with q = p,t = 1 + 2a, u = 1 + 2a + 2p, 
v = 1 and A^" = nr^ to see that s n x n~ l Yli<N 1 / n <S 2 (i)/i. 

(ii) If I G S q , then the bias is bounded above in (7.7), and in the proof 
of Theorem 5.1 its supremum B n over ||//o||/3 ^ 1 is seen to be bounded by 
(nTn)~( /3+Q ^( 1+2a+2pS)A1 , the first term in the definition of e n in the statement 
of this theorem. This upper bound is o(ra -1 / 2 ) iff the stated conditions hold. 
[Here we use that S 2 {N) <C ^2i<N<S 2 {i)/i as N — > oo, as noted in the proof 
of Lemma 8.2.] 

The supremum of the bias B n in the case that I G lZ q is given in (7.14). It 
was already seen to be o(t n ) if t 3> f n in the proof of case (i) of Theorem 5.3. 
If T n = 1, we have that B n x n~^ + ^^ 1+2a+2pS,A1 j n , for 7„ the slowly varying 
factor given in the statement of Theorem 5.2. Furthermore, we have s n X 
t n x n -1 / 2 <5 n , for 5 n the slowly varying factor in the same statement. Under 
the present conditions, 5 n x 1 if q > p and 5 2 x J2i< Pn ^ 2 (^)/* ^ Q = P- We 
can now verify that B n = o(t n ) if and only if the conditions as stated hold. 

(iii) The total variation distance between two Gaussian distributions with 
the same expectation and standard deviations s n and t n tends to zero if 
and only if s n /t n —> 1- Similarly, the total distance between two Gaussians 
with the same standard deviation s n and means fi n and v n tends to zero 
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if and only if \i n — v n = o(s n ). Therefore, it suffices to show that (LAY — 
YliYik/ 'Ki) I ' s n — > if I G S p . Because the bias was already seen to be o(t n ) 
and s n x n -1 / 2 if / € S p , it suffices to show that LAZ — Y2i^ih/ K i — > 0. 
Under Assumption 3.1 this difference is equal to 

EKiXiiiZi \^ Z- — i ( ^ 
n" 1 + k 2 \ ^ % ~ ^~K~\l + n,K 2 \ 

If li/ K i < °° j then the variance of this expression is seen to tend to zero 
by dominated convergence. 

The final assertion of the theorem follows along the lines of the proof of 
Theorem 5.3. 

8. Technical lemmas. 



Lemma 8.1. For any q>0, t> —2q, u>0 and v>0, as N — >■ 00, 

^■2„— t 

sup 



£i * _ jy-{(t+2q)/u)Av 



(1 + Ni~ u ) 

Moreover, for every fixed £ E S q , as N — > 00, 

,v((H-a«)/«)A« V ,J°> 2/(t + 2g)/n< U; 

+ \ llfll( w -t)/ 2 . «/(t + 2g)/n> V . 

The last assertion remains true if the sum is limited to the terms i < cN l l u , 
for any c> . 

Proof. In the range i < N l / U we have Ni~~ u < 1 + Ni~ u < 2Ni~ u , while 
1 < 1 + Ni~ u < 2 in the range i > N l / U . Thus, deleting either the first or 
second term, we obtain 

tl •— t ;uv—t—2q 

~ V t 2 i 2q - < ||<||27v-((*+ 2 <?)/«)^ 

Tl/u ^ ^ I 



i<N 1 / u i<N 1 / u 



2„— t 



The inequality in the first line follows by bounding i in i uv ~ t ~ 2 i by iV 1 / - " if 
— t — 2q > 0, and by 1 otherwise. This proves the upper bound for the 
supremum. 

The lower bound follows by considering the two sequences given by 
£j = i~ q for i ~ N l l u and £j = otherwise (showing that the supremum is 
bigger than N~ lyt+2q ^ u ) 1 and given by £1 = 1 and £j = otherwise (showing 
that the supremum is bigger than N~ v ). 



2 i 2q . 
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The second line of the preceding display shows that the sum over the terms 
i > N l l u is o(N~( t+2q ^ u ). Furthermore, the first line can be multiplied by 

N {t+2q)/u tQ obtain 



N (t+2q)/u V ^ 1 ~ V f 2 i 2q ( 1 \ 



uv—t—2q 



i<N 1 / u i<N 1 / u 



If (t + 2q)/u < v, then uv — t — 2q > and this tends to zero by dominated 
convergence. Also, 

n v v ^— = v e 2 i uv - f ( — 

i i 

If (t + 2q)/u > v, then q > (uv - t)/2 and, hence, £ G S*^" 4 )/ 2 , and the 
right-hand side tends to by dominated convergence. 

The final assertion needs to be proved only in the case that (t + 2q) /u>v, 
as in the other case the whole sum tends to 0. The sum over the terms i > 
TV 1 /'" was seen to be always o(i\H t+29 )/ u ) , which is o(N~ v ) if (t + 2q)/u > v. 
The final assertion for c = 1 follows, because the sum over the terms i < A^ 1 /" 
was seen to have the exact order N~ v (if £ ^ 0). For general c the proof is 
analogous, or follows by scaling N. □ 

Lemma 8.2. For any t, v > 0, u > 0, and (&) suc/i i/iai |^| = ■i~ 9 ~ 1 / 2 5(i) 
/or o > — i/2 and a slowly varying function S : (0, oo) — > (0, oo), as N oo, 

' N -(t+2q)/u S 2( N l/u^ { f ^ + 2g )/ u < V} 
N~ v Yl 52 (*)/*. if(t + 2q)/u = v, 

N~ v , if (t + 2q)/u>v. 

Moreover, for every c> 0, the sum on the left is asymptotically equivalent to 
the same sum restricted to the terms i < ciV 1 /" if and only if (t + 2q) /u>v. 

Proof. As in the proof of the preceding lemma, we split the infinite 
series in the sum over the terms i < N l / U and i > N l / U . For the first part of 
the series 

V" ^i 1 x V S(i) 2 - 

^ (1 + Ni~ u ) v K > N v ' 

i<N 1 / u i<N 1 / u 

If uv — t — 2q > [i.e., (t + 2q)/u < v], the right-hand side is of the order 
N -(t+2q)/u S 2( N i/u^ by Theorem i( b ) on page 281 in [10], while if uv - 

t — 2q < 0, it is of the order N~ v by Lemma on page 280 in [10]. Finally, if 
uv — t — 2q = 0, then the right-hand side is identical to N~ v J2i<N 1 / u S 2 (i)/i. 



Sir 



2„— t 



(1 + Ni~ u ) 
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The other part of the infinite series satisfies, by Theorem 1(a) on page 281 
in [10], 

t2A-t 

(1 + Ni~ u ) v 



E 5 « 

i>ArV« 



2~t-2q-l 



jy-(t+2q)/ug2 (_/y V" ) _ 



i>N 1 / u 

This is never bigger than the contribution of the first part of the sum, and 
of equal order if (t + 2q)/u < v. If (t + 2q)/u > v, then the leading polyno- 
mial term is strictly smaller than N~ v . If (t + 2q)/u = v, then the leading 
term is equal to N~ v , but the slowly varying part satisfies S 2 (N 1 ^ U ) <C 
J^-kn 1 ^ S 2 {i)/i, by Theorem 1(b) on page 281 in [10]. Therefore, in both 
cases the preceding display is negligible relative to the first part of the sum. 
This proves the final assertion of the lemma for c = 1. The proof for general 
c> is analogous. □ 

By the Cauchy-Schwarz inequality, for any \i E S l l 2 , 



E T 



+ Ni~ u 



< 



1/2 



(1 + Ni~ u ) 



u\2 " 



The preceding lemma gives the exact order of the right-hand side. The appli- 
cation of the Cauchy-Schwarz inequality is sharp, in that there is equality 
for some ji G S*/ 2 . However, this fi depends on N. For fixed fi G S 1 ^ 2 the 
left-hand side is strictly smaller than the right-hand side. 

Lemma 8.3. For any t,u > 0, fj, G S l l 2 and (&) such that |^| = 
i~ q ~ l / 2 S(i) for < t + 2q < 2u and a slowly varying function S : (0, oo) — > 
(0, oo), as N — > oo, 



E r 



+ Ni~ u 



<c Ar~(* +2g )/( 2u )5(Ar 1 /' u ). 



Proof. We split the series in two parts, and bound the denominator 
1 + Ni~ u by Ni~ u or 1. By the Cauchy-Schwarz inequality, for any r > 0, 



E 



E 



< 



1 

iV 2 



E 

i<jVV« 



S 2 (iV 



E ^ 2u - 29 " r 



i<Af 1 /" 



1 



x E ^ 

i<ATi/« 
?2 



ATI/" 



2u-2q-r-t 



jq(2u—2q—r—t)/u 



£ E 

i>JV 1 /' 



S 2 (z) 



-29 



£ / , 2 x5 2 (7VV« )JV - 2 ^ £ ^ 2 
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The terms in the remaining series in the right-hand side of the first inequality 
are bounded by n^i 1 and tend to zero pointwise as N — > oo if 2u — 2q — r — t > 
0. If t + 2q < 2u, then there exists r > such that the latter is true, and for 
this r the sum tends to zero by the dominated convergence theorem. The 
other terms collect to N~^ t+2q ^^S 2 {N 1 / u ). The sum in the right-hand side 
of the second inequality is bounded by X^tv 1 /" [i\i t N~ t l u = o(N~ t ' v ). □ 
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